Brief Summary: Asynchronous Methods for Deep Reinforcement Learning (A3C)

Citation: Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, Koray Kavukcuoglu. Google DeepMind / MILA. ICML 2016.


Problem

Deep RL with neural networks was considered unstable when applied online, because consecutive observations are strongly correlated (non-stationary data stream). The standard fix — experience replay memory — stabilizes training by breaking correlations, but it (a) requires large memory, (b) incurs extra compute per step, and (c) restricts the approach to off-policy algorithms only. This blocks the use of on-policy methods (Sarsa, actor-critic) with deep networks.

Core Insight

Running multiple independent actor-learners in parallel on different copies of the environment naturally decorrelates the training data without any replay buffer, because at any given moment the parallel agents are experiencing diverse states. This simple observation makes on-policy deep RL stable and enables a far wider class of algorithms (Sarsa, n-step Q-learning, actor-critic) to be trained successfully with neural networks. As a bonus, parallelism is achieved on a standard multi-core CPU — no GPU required.

Method: Asynchronous Advantage Actor-Critic (A3C)

Four asynchronous algorithms are presented: one-step Q-learning, one-step Sarsa, n-step Q-learning, and the headline method A3C.

In A3C:

Key Results

Limitations


Relevance to DynamICCL

High direct relevance — this is a foundational algorithm paper for DynamICCL's RL design.

DynamICCL's Config Agent (Agent-2) uses DQN/RL to select NCCL parameters. A3C is a direct alternative to DQN with two critical advantages for the DynamICCL setting:

  1. Parallelism for faster policy training: DynamICCL must train online in a live HPC cluster. A3C's multi-worker parallel training could allow simultaneous exploration of NCCL configurations across multiple concurrent collective operations, reducing wall-clock training time.

  2. On-policy stability without replay: DynamICCL's environment is non-stationary (congestion levels change), making replay-based methods potentially harmful (stale transitions from a different congestion regime). A3C's on-policy, replay-free approach is more robust to this non-stationarity.

  3. LSTM extension: A3C with LSTM is directly applicable to DynamICCL's Trigger Agent (Agent-1), which already uses LSTM+CUSUM for temporal pattern detection. An A3C-LSTM config agent could jointly learn to detect congestion and select NCCL parameters in a single recurrent actor-critic policy.

  4. Actor-Critic vs. DQN: The advantage function A(s,a) = Q(s,a) - V(s) in A3C provides lower-variance policy gradient estimates than raw DQN Q-values, which is valuable when rewards in DynamICCL (collective completion time deltas) have high variance due to network jitter.

  5. CPU-only training: DynamICCL runs on Chameleon Cloud bare-metal nodes; A3C's CPU-based training eliminates GPU dependency during the RL training phase itself.